In this video, you'll learn how to control the
model and what it generates.
For many, this is the most exciting piece because you
get to tell the model what you want and it gets to imagine it for you.
When it comes to controlling these models,
we actually want to use embeddings. And what embeddings are,
which we looked at a little bit in previous videos of a
time embedding and a context embedding,
what embeddings are is they're vectors, they're numbers
that are able to capture meaning. And here
it's capturing the meaning of this sentence, or this joke
perhaps, through diffusion models. "Brownians often bump
into each other". So it encodes that into this embedding, which is this
set of numbers in a vector.
And what's special about embeddings is because they can capture
the semantic meaning, text with similar content will
have similar vectors. And one of the kind
of magical things about embeddings is you can
almost do this vector arithmetic with it. So "Paris - France + England = the London" embedding,
for example.
Okay, so how do these embeddings actually become context to the model during training? Well, here you have an avocado image, which you want the neural network And you also have a caption for it, a ripe avocado. And you can actually pass that through, get an embedding and input that into the neural network to then predict the noise that was added to this avocado image and then compute the loss and do the same thing as before. And you could do this across a lot of different images with captions. So here is a comfy armchair. You can pass it through an embedding, pass it into the model and have that be part of training.
Now, the magic of this section is that while you were able to
scrape these images of avocados and armchairs off the
internet with those captions, you're able to, at sample time, be
able to generate things that the model has
never seen before. And that could be an "avocado
armchair". And the magic of this is because
you can embed the words "avocado armchair" into
this embedding that has, you know, a bit of avocado in there, a
bit of armchair in there, put that through the neural network, have
it predict noise, subtract that noise out and get, lo
and behold, an "avocado armchair".
So more broadly context is a vector that can control generation.
Context can be, just as we have seen now,
the text embeddings of that "avocado armchair" that's very long. But
context doesn't have to be that long.
Context can also be different categories that are
five in length, you know, five different dimensions, such as
having a hero or being a non-hero like these objects of a fireball and
a mushroom. It could be food items, you know, apple, orange, watermelon.
It could be spells and weapons like
this bow and arrow or this candle. And finally, it
could be whether these sprites are side-facing or not.
So now let's take a look at adding context to the model in this lab.
from typing import Dict, Tuple
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import models, transforms
from torchvision.utils import save_image, make_grid
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter
import numpy as np
from IPython.display import HTML
from diffusion_utilities import *
class ContextUnet(nn.Module):
def __init__(self, in_channels, n_feat=256, n_cfeat=10, height=28): # cfeat - context features
super(ContextUnet, self).__init__()
# number of input channels, number of intermediate feature maps and number of classes
self.in_channels = in_channels
self.n_feat = n_feat
self.n_cfeat = n_cfeat
self.h = height #assume h == w. must be divisible by 4, so 28,24,20,16...
# Initialize the initial convolutional layer
self.init_conv = ResidualConvBlock(in_channels, n_feat, is_res=True)
# Initialize the down-sampling path of the U-Net with two levels
self.down1 = UnetDown(n_feat, n_feat) # down1 #[10, 256, 8, 8]
self.down2 = UnetDown(n_feat, 2 * n_feat) # down2 #[10, 256, 4, 4]
# original: self.to_vec = nn.Sequential(nn.AvgPool2d(7), nn.GELU())
self.to_vec = nn.Sequential(nn.AvgPool2d((4)), nn.GELU())
# Embed the timestep and context labels with a one-layer fully connected neural network
self.timeembed1 = EmbedFC(1, 2*n_feat)
self.timeembed2 = EmbedFC(1, 1*n_feat)
self.contextembed1 = EmbedFC(n_cfeat, 2*n_feat)
self.contextembed2 = EmbedFC(n_cfeat, 1*n_feat)
# Initialize the up-sampling path of the U-Net with three levels
self.up0 = nn.Sequential(
nn.ConvTranspose2d(2 * n_feat, 2 * n_feat, self.h//4, self.h//4), # up-sample
nn.GroupNorm(8, 2 * n_feat), # normalize
nn.ReLU(),
)
self.up1 = UnetUp(4 * n_feat, n_feat)
self.up2 = UnetUp(2 * n_feat, n_feat)
# Initialize the final convolutional layers to map to the same number of channels as the input image
self.out = nn.Sequential(
nn.Conv2d(2 * n_feat, n_feat, 3, 1, 1), # reduce number of feature maps #in_channels, out_channels, kernel_size, stride=1, padding=0
nn.GroupNorm(8, n_feat), # normalize
nn.ReLU(),
nn.Conv2d(n_feat, self.in_channels, 3, 1, 1), # map to same number of channels as input
)
def forward(self, x, t, c=None):
"""
x : (batch, n_feat, h, w) : input image
t : (batch, n_cfeat) : time step
c : (batch, n_classes) : context label
"""
# x is the input image, c is the context label, t is the timestep, context_mask says which samples to block the context on
# pass the input image through the initial convolutional layer
x = self.init_conv(x)
# pass the result through the down-sampling path
down1 = self.down1(x) #[10, 256, 8, 8]
down2 = self.down2(down1) #[10, 256, 4, 4]
# convert the feature maps to a vector and apply an activation
hiddenvec = self.to_vec(down2)
# mask out context if context_mask == 1
if c is None:
c = torch.zeros(x.shape[0], self.n_cfeat).to(x)
# embed context and timestep
cemb1 = self.contextembed1(c).view(-1, self.n_feat * 2, 1, 1) # (batch, 2*n_feat, 1,1)
temb1 = self.timeembed1(t).view(-1, self.n_feat * 2, 1, 1)
cemb2 = self.contextembed2(c).view(-1, self.n_feat, 1, 1)
temb2 = self.timeembed2(t).view(-1, self.n_feat, 1, 1)
#print(f"uunet forward: cemb1 {cemb1.shape}. temb1 {temb1.shape}, cemb2 {cemb2.shape}. temb2 {temb2.shape}")
up1 = self.up0(hiddenvec)
up2 = self.up1(cemb1*up1 + temb1, down2) # add and multiply embeddings
up3 = self.up2(cemb2*up2 + temb2, down1)
out = self.out(torch.cat((up3, x), 1))
return out
# hyperparameters
# diffusion hyperparameters
timesteps = 500
beta1 = 1e-4
beta2 = 0.02
# network hyperparameters
device = torch.device("cuda:0" if torch.cuda.is_available() else torch.device('cpu'))
n_feat = 64 # 64 hidden dimension feature
n_cfeat = 5 # context vector is of size 5
height = 16 # 16x16 image
save_dir = './weights/'
# training hyperparameters
batch_size = 100
n_epoch = 32
lrate=1e-3
# construct DDPM noise schedule
b_t = (beta2 - beta1) * torch.linspace(0, 1, timesteps + 1, device=device) + beta1
a_t = 1 - b_t
ab_t = torch.cumsum(a_t.log(), dim=0).exp()
ab_t[0] = 1
# construct model
nn_model = ContextUnet(in_channels=3, n_feat=n_feat, n_cfeat=n_cfeat, height=height).to(device)
# reset neural network
nn_model = ContextUnet(in_channels=3, n_feat=n_feat, n_cfeat=n_cfeat, height=height).to(device)
# re setup optimizer
optim = torch.optim.Adam(nn_model.parameters(), lr=lrate)
# load in pretrain model weights and set to eval mode
nn_model.load_state_dict(torch.load(f"{save_dir}/context_model_trained.pth", map_location=device))
nn_model.eval()
print("Loaded in Context Model")
Loaded in Context Model
# helper function; removes the predicted noise (but adds some noise back in to avoid collapse)
def denoise_add_noise(x, t, pred_noise, z=None):
if z is None:
z = torch.randn_like(x)
noise = b_t.sqrt()[t] * z
mean = (x - pred_noise * ((1 - a_t[t]) / (1 - ab_t[t]).sqrt())) / a_t[t].sqrt()
return mean + noise
# sample with context using standard algorithm
@torch.no_grad()
def sample_ddpm_context(n_sample, context, save_rate=20):
# x_T ~ N(0, 1), sample initial noise
samples = torch.randn(n_sample, 3, height, height).to(device)
# array to keep track of generated steps for plotting
intermediate = []
for i in range(timesteps, 0, -1):
print(f'sampling timestep {i:3d}', end='\r')
# reshape time tensor
t = torch.tensor([i / timesteps])[:, None, None, None].to(device)
# sample some random noise to inject back in. For i = 1, don't add back in noise
z = torch.randn_like(samples) if i > 1 else 0
eps = nn_model(samples, t, c=context) # predict noise e_(x_t,t, ctx)
samples = denoise_add_noise(samples, i, eps, z)
if i % save_rate==0 or i==timesteps or i<8:
intermediate.append(samples.detach().cpu().numpy())
intermediate = np.stack(intermediate)
return samples, intermediate
# visualize samples with randomly selected context
plt.clf()
ctx = F.one_hot(torch.randint(0, 5, (32,)), 5).to(device=device).float()
samples, intermediate = sample_ddpm_context(32, ctx)
animation_ddpm_context = plot_sample(intermediate,32,4,save_dir, "ani_run", None, save=False)
HTML(animation_ddpm_context.to_jshtml())
gif animating frame 31 of 32
<Figure size 640x480 with 0 Axes>
def show_images(imgs, nrow=2):
_, axs = plt.subplots(nrow, imgs.shape[0] // nrow, figsize=(4,2 ))
axs = axs.flatten()
for img, ax in zip(imgs, axs):
img = (img.permute(1, 2, 0).clip(-1, 1).detach().cpu().numpy() + 1) / 2
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(img)
plt.show()
# user defined context
ctx = torch.tensor([
# hero, non-hero, food, spell, side-facing
[1,0,0,0,0],
[1,0,0,0,0],
[0,0,0,0,1],
[0,0,0,0,1],
[0,1,0,0,0],
[0,1,0,0,0],
[0,0,1,0,0],
[0,0,1,0,0],
]).float().to(device)
samples, _ = sample_ddpm_context(ctx.shape[0], ctx)
show_images(samples)
sampling timestep 1
# mix of defined context
ctx = torch.tensor([
# hero, non-hero, food, spell, side-facing
[1,0,0,0,0], #human
[1,0,0.6,0,0],
[0,0,0.6,0.4,0],
[1,0,0,0,1],
[1,1,0,0,0],
[1,0,0,1,0]
]).float().to(device)
samples, _ = sample_ddpm_context(ctx.shape[0], ctx)
show_images(samples)
sampling timestep 1
Sprites by ElvGames, FrootsnVeggies and kyrise
This code is modified from, https://github.com/cloneofsimo/minDiffusion
Diffusion model is based on Denoising Diffusion Probabilistic Models and Denoising Diffusion Implicit Models